## PCT

09/337,839

# WORLD INTELLECTUAL PROPERTY ORGANIZATION



INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT)

(51) International Patent Classification 6: WO 00/22503 (11) International Publication Number: G06F 7/50, 7/52, 15/16, 15/80, 17/10, A1 (43) International Publication Date: 20 April 2000 (20.04.00) 17/14

PCT/HS99/23494 (81) Designated States: CA, CN, IL, JP, KR, MX, European patent (21) International Application Number: (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, IE, IT,

LU, MC, NL, PT, SE). (22) International Filing Date: 8 October 1999 (08,10,99)

(30) Priority Data: Published 60/103.712 9 October 1998 (09.10.98) US With international search report. Before the expiration of the time limit for amending the TIC

claims and to be republished in the event of the receipt of amendments. (71) Applicant; BOPS INCORPORATED [US/US]; Suite 210, 6340 Quadrangle Drive, Chapel Hill, NC 27514 (US).

(72) Inventors: PITSIANIS, Nikos, P.; Apartment 11, 6205 Far-rington Road, Chapel Hill, NC 27514 (US). PECHANEK, Gerald, G.; 107 Stoneleigh Drive, Cary, NC 27511 (US). RODRIGUEZ, Ricardo, E.; 9116 Erinsbrook Drive, Raleigh, NC 27613 (US).

22 June 1999 (22,06,99)

(74) Agent: PRIEST, Peter, H.; Law Offices of Peter H. Priest, 529 Dogwood Drive, Chapel Hill, NC 27516 (US).

(54) Title: EFFICIENT COMPLEX MULTIPLICATION AND FAST FOURIER TRANSFORM (FFT) IMPLEMENTATION ON THE MANARRAY ARCHITECTURE

#### (57) Abstract

Efficient computation of complex multiplication results and very efficient fast Fourier transforms (FFTs) are provided. A parallel array VLIW digital signal processor (100) is employed along with specialized complex multiplication instructions and communication operations between the processing elements (101, 151, 153, 155) which are overlapped with computation to provide very high performance operation. Successive iterations of a loop of tightly packed VLIWs (100) are used allowing the complex multiplication pipeline hardware to be efficiently used. In addition, efficient techniques for supporting combined multiply accumulate operations are described.



## FOR THE PURPOSES OF INFORMATION ONLY

Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT.

| AL | Albania                  | ES  | Spain               | LS | Lesotho               | SI | Slovenia                 |
|----|--------------------------|-----|---------------------|----|-----------------------|----|--------------------------|
| AM | Armenia                  | FI  | Finland             | LT | Lithuania             | SK | Slovakia                 |
| AT | Austria                  | FR  | Prance              | LU | Luxembourg            | SN | Senegal                  |
| AU | Australia                | GA  | Gabon               | LV | Latvia                | SZ | Swaziland                |
| AZ | Azerbaijan               | GB  | United Kingdom      | MC | Monaco                | TD | Chad                     |
| BA | Bosnia and Herzegovina   | GE  | Georgia             | MD | Republic of Moldova   | TG | Togo                     |
| BB | Barbados                 | CH  | Ghans               | MG | Madagascar            | TJ | Tajikistan               |
| BE | Belgium                  | GN  | Guinea              | MK | The former Yugoslav   | TM | Turkmenistan             |
| BF | Burkina Faso             | GR  | Greece              |    | Republic of Macedonia | TR | Turkey                   |
| BG | Bulgaria                 | HU  | Hungary             | ML | Muli                  | TT | Trinidad and Tobago      |
| BJ | Benin                    | 112 | Ireland             | MN | Mongolia              | UA | Ukraine                  |
| BR | Brazil                   | ΙL  | Israel              | MR | Mauritania            | UG | Uganda                   |
| BY | Belarus                  | IS  | Iceland             | MW | Malawi                | US | United States of America |
| CA | Canada                   | IT  | Italy               | MX | Mexico                | UZ | Uzbekistan               |
| CF | Central African Republic | JP  | Japan               | NE | Niger                 | VN | Viet Nam                 |
| CG | Congo                    | KE  | Kenya               | NL | Netherlands           | YU | Yugoslavia               |
| CH | Switzerland              | KG  | Kyrgyzstan          | NO | Norway                | zw | Zimbabwe                 |
| CI | Côte d'Ivoire            | KP  | Democratic People's | NZ | New Zealand           |    |                          |
| CM | Cameroon                 |     | Republic of Korea   | PL | Poland                |    |                          |
| CN | China                    | KR  | Republic of Korea   | PT | Portugal              |    |                          |
| CU | Cuba                     | KZ  | Kazakstan           | RO | Romania               |    |                          |
| CZ | Czech Republic           | LC  | Saint Lucia         | RU | Russian Federation    |    |                          |
| DE | Germany                  | LI  | Liechtenstein       | SD | Sudan                 |    |                          |
| DK | Denmark                  | LK  | Sri Lanka           | SE | Sweden                |    |                          |
| EE | Estonia                  | LR  | Liberia             | SG | Singapore             |    |                          |
|    |                          |     |                     |    |                       |    |                          |

WO 00/22503 PCT/US99/23494

### Efficient Complex Multiplication and Fast Fourier Transform (FFT) Implementation on the ManArray Architecture

This application claims the benefit of U.S. Provisional Application Serial No. 60/103,712 filed October 9, 1998 which is incorporated by reference in its entirety herein. Field of the Invention

The present invention relates generally to improvements to parallel processing, and more particularly to methods and apparatus for efficiently calculating the result of a complex multiplication. Further, the present invention relates to the use of this approach in a very efficient FFT implementation on the manifold array ("ManArray") processing architecture. Background of the Invention

The product of two complex numbers x and y is defined to be  $z = x_R y_R - x_I y_I + i(x_R y_I + x_I y_R)$ , where  $x = x_R + ix_h y = y_R + iy_I$  and i is an imaginary number, or the square root of negative one, with  $t^2 = -I$ . This complex multiplication of x and y is calculated in a variety of contexts, and it has been recognized that it will be highly advantageous to perform this calculation faster and more efficiently.

### Summary of the Invention

5

20

30

The present invention defines hardware instructions to calculate the product of two complex numbers encoded as a pair of two fixed-point numbers of 16 bits each in two cycles with single cycle pipeline throughput efficiency. The present invention also defines extending a series of multiply complex instructions with an accumulate operation. These special instructions are then used to calculate the FFT of a vector of numbers efficiently.

A more complete understanding of the present invention, as well as other features and advantages of the invention will be apparent from the following Detailed Description and the accompanying drawings.

## Brief Description of the Drawings

- Fig. 1 illustrates an exemplary 2x2 ManArray iVLIW processor;
- Fig. 2A illustrates a presently preferred multiply complex instruction, MPYCX;
- Fig. 2B illustrates the syntax and operation of the MPYCX instruction of Fig. 2A;
  Fig. 3A illustrates a presently preferred multiply complex divide by 2 instruction,
  MPYCXD2:
  - Fig. 3B illustrates the syntax and operation of the MPYCXD2 instruction of Fig. 3A:

25

Fig. 4A illustrates a presently preferred multiply complex conjugate instruction, MPYCXI:

Fig. 4B illustrates the syntax and operation of the MPYCXJ instruction of Fig. 4A:

Fig. 5A illustrates a presently preferred multiply complex conjugate divide by two instruction, MPYCXJD2;

Fig. 5B illustrates the syntax and operation of the MPYCXJD2 instruction of Fig. 5A;

Fig. 6 illustrates hardware aspects of a pipelined multiply complex and its divide by two instruction variant:

Fig. 7 illustrates hardware aspects of a pipelined multiply complex conjugate, and its divide by two instruction variant;

Fig. 8 shows an FFT signal flow graph;

Fig. 9A-9H illustrate aspects of the implementation of a distributed FFT algorithm on a 2x2 ManArray processor using a VLIW algorithm with MPYCX instructions in a cycle-bycycle sequence with each step corresponding to operations in the FFT signal flow graph;

Fig. 9I illustrates how multiple iterations may be tightly packed in accordance with the present invention for a distributed FFT of length four;

Fig. 9J illustrates how multiple iterations may be tightly packed in accordance with the present invention for a distributed FFT of length two;

Figs. 10A and 10B illustrate Kronecker Product examples for use in reference to the mathematical presentation of the presently preferred distributed FFT algorithm;

Fig. 11A illustrates a presently preferred multiply accumulate instruction, MPYA;

Fig. 11B illustrates the syntax and operation of the MPYA instruction of Fig. 11A;

Fig. 12A illustrates a presently preferred sum of 2 products accumulate instruction, SUM2PA:

Fig. 12B illustrates the syntax and operation of the SUM2PA instruction of Fig. 12A;

Fig. 13A illustrates a presently preferred multiply complex accumulate instruction, MPYCXA:

 $Fig.\ 13B\ illustrates\ the\ syntax\ and\ operation\ of\ the\ MPYCXA\ instruction\ of\ Fig.\ 13A;$ 

Fig. 14A illustrates a presently preferred multiply complex accumulate divide by two instruction. MPYCXAD2:

Fig. 14B illustrates the syntax and operation of the MPYCXAD2 instruction of Fig. 14A; Fig. 15A illustrates a presently preferred multiply complex conjugate accumulate instruction. MPYCXJA:

Fig. 15B illustrates the syntax and operation of the MPYCXJA instruction of Fig. 15A;

Fig. 16A illustrates a presently preferred multiply complex conjugate accumulate divide by two instruction, MPYCXJAD2:

Fig. 16B illustrates the syntax and operation of the MPYCXJAD2 instruction of Fig. 16A:

Fig. 17 illustrates hardware aspects of a pipelined multiply complex accumulate and its divide by two variant; and

Fig. 18 illustrates hardware aspects of a pipelined multiply complex conjugate accumulate and its divide by two variant.

#### **Detailed Description**

Further details of a presently preferred ManArray architecture for use in conjunction with the present invention are found in U.S. Patent Application Serial No. 08/885,310 filed June 30, 1997, U.S. Patent Application Serial No. 08/949,122 filed October 10, 1997, U.S. Patent Application Serial No. 09/169,255 filed October 9, 1998, U.S. Patent Application Serial No. 09/169,256 filed October 9, 1998, U.S. Patent Application Serial No. 09/169,072 filed October 9, 1998, U.S. Patent Application Serial No. 09/187,539 filed November 6, 20 1998, U.S. Patent Application Serial No. 09/205,558 filed December 4, 1998, U.S. Patent Application Serial No. 09/215,081 filed December 18, 1998, U.S. Patent Application Serial No. 09/228,374 filed January 12, 1999, U.S. Patent Application Serial No. 09/238,446 filed January 28, 1999, U.S. Patent Application Serial No. 09/267,570 filed March 12, 1999, as well as, Provisional Application Serial No. 60/092,130 entitled "Methods and Apparatus for Instruction Addressing in Indirect VLIW Processors" filed July 9, 1998, Provisional Application Serial No. 60/103,712 entitled "Efficient Complex Multiplication and Fast Fourier Transform (FFT) Implementation on the ManArray" filed October 9, 1998. Provisional Application Serial No. 60/106,867 entitled "Methods and Apparatus for Improved Motion Estimation for Video Encoding" filed November 3, 1998, Provisional Application Serial No. 60/113,637 entitled "Methods and Apparatus for Providing Direct Memory Access (DMA) Engine" filed December 23, 1998 and Provisional Application Serial No. 60/113,555 entitled "Methods and Apparatus Providing Transfer Control" filed

December 23, 1998, respectively, and incorporated by reference herein in their entirety.

WO 00/22503 PCT/US99/23494

In a presently preferred embodiment of the present invention, a ManArray 2x2 IVLIW single instruction multiple data stream (SIMD) processor 100 shown in Fig. 1 contains a controller sequence processor (SP) combined with processing element-0 (PE0) SP/PE0 101, as described in further detail in U.S. Application Serial No. 09/169,072 entitled "Methods and Apparatus for Dynamically Merging an Array Controller with an Array Processing Element". Three additional PEs 151, 153, and 155 are also utilized to demonstrate the implementation of efficient complex multiplication and fast fourier transform (FFT) computations on the ManArray architecture in accordance with the present invention. It is noted that the PEs can be also labeled with their matrix positions as shown in parentheses for PE0 (PE00) 101, PE1 (PE01) 151, PE2 (PE10) 153, and PE3 (PE11) 155.

The SP/PE0 101 contains a fetch controller 103 to allow the fetching of short instruction words (SIWs) from a 32-bit instruction memory 105. The fetch controller 103 provides the typical functions needed in a programmable processor such as a program counter (PC), branch capability, digital signal processing, EP loop operations, support for interrupts, and also provides the instruction memory management control which could include an instruction cache if needed by an application. In addition, the SIW I-Fetch controller 103 dispatches 32-bit SIWs to the other PEs in the system by means of a 32-bit instruction bus 102.

In this exemplary system, common elements are used throughout to simplify the explanation, though actual implementations are not so limited. For example, the execution 20 units 131 in the combined SP/PE0 101 can be separated into a set of execution units optimized for the control function, e.g. fixed point execution units, and the PE0 as well as the other PEs 151, 153 and 155 can be optimized for a floating point application. For the purposes of this description, it is assumed that the execution units 131 are of the same type in the SP/PE0 and the other PEs. In a similar manner, SP/PE0 and the other PEs use a five instruction slot iVLIW architecture which contains a very long instruction word memory (VIM) memory 109 and an instruction decode and VIM controller function unit 107 which receives instructions as dispatched from the SP/PE0's I-Fetch unit 103 and generates the VIM addresses-and-control signals 108 required to access the iVLIWs stored in the VIM. These iVLIWs are identified by the letters SLAMD in VIM 109. The loading of the iVLIWs is described in further detail in U.S. Patent Application Serial No. 09/187,539 entitled "Methods and Apparatus for Efficient Synchronous MIMD Operations with iVLIW PE-to-PE Communication". Also contained in the SP/PE0 and the other PEs is a common PE

configurable register file 127 which is described in further detail in U.S. Patent Application Serial No. 09/169,255 entitled "Methods and Apparatus for Dynamic Instruction Controlled Reconfiguration Register File with Extended Precision".

Due to the combined nature of the SP/PE0, the data memory interface controller 125 must handle the data processing needs of both the SP controller, with SP data in memory 121, and PEO, with PEO data in memory 123. The SP/PEO controller 125 also is the source of the data that is sent over the 32-bit broadcast data bus 126. The other PEs 151, 153, and 155 contain common physical data memory units 123', 123", and 123" though the data stored in them is generally different as required by the local processing done on each PE. The interface to these PE data memories is also a common design in PEs 1, 2, and 3 and indicated by PE local memory and data bus interface logic 157, 157 and 157". Interconnecting the PEs for data transfer communications is the cluster switch 171 more completely described in U.S. Patent Application Serial No. 08/885,310 entitled "Manifold Array Processor", U.S. Application Serial No. 09/949,122 entitled "Methods and Apparatus for Manifold Array Processing", and U.S. Application Serial No. 09/169,256 entitled "Methods and Apparatus for ManArray PE-to-PE Switch Control". The interface to a host processor, other peripheral devices, and/or external memory can be done in many ways. The primary mechanism shown for completeness is contained in a direct memory access (DMA) control unit 181 that provides a scalable ManArray data bus 183 that connects to devices and interface units external to the ManArray core. The DMA control unit 181 provides the data flow and bus arbitration mechanisms needed for these external devices to interface to the ManArray core memories via the multiplexed bus interface represented by line 185. A high level view of a ManArray Control Bus (MCB) 191 is also shown.

All of the above noted patents are assigned to the assignee of the present invention and incorporated herein by reference in their entirety. Special Instructions for Complex Multiply

Turning now to specific details of the ManArray processor as adapted by the present invention, the present invention defines the following special hardware instructions that execute in each multiply accumulate unit (MAU), one of the execution units 131 of Fig. 1 and in each PE, to handle the multiplication of complex numbers:

MPYCX instruction 200 (Fig. 2A), for multiplication of complex numbers, where the
complex product of two source operands is rounded according to the rounding mode
specified in the instruction and loaded into the target register. The complex numbers

are organized in the source register such that halfword H1 contains the real component and halfword H0 contains the imaginary component. The MPYCX instruction format is shown in Fig. 2A. The syntax and operation description 210 is shown in Fig. 2B.

- MPYCXD2 instruction 300 (Fig. 3A), for multiplication of complex numbers, with
  the results divided by 2, Fig. 3, where the complex product of two source operands is
  divided by two, rounded according to the rounding mode specified in the instruction,
  and loaded into the target register. The complex numbers are organized in the source
  register such that halfword H1 contains the real component and halfword H0 contains
  the imaginary component. The MPYCXD2 instruction format is shown in Fig. 3A.
   The syntax and operation description 310 is shown in Fig. 3B.
  - MPYCXJ instruction 400 (Fig. 4A), for multiplication of complex numbers where the
    second argument is conjugated, where the complex product of the first source operand
    times the conjugate of the second source operand, is rounded according to the
    rounding mode specified in the instruction and loaded into the target register. The
    complex numbers are organized in the source register such that halfword H1 contains
    the real component and halfword H0 contains the imaginary component. The
    MPYCXJ instruction format is shown in Fig. 4A. The syntax and operation
- MPYCXJD2 instruction 500 (Fig. 5A), for multiplication of complex numbers where
  the second argument is conjugated, with the results divided by 2, where the complex
  product of the first source operand times the conjugate of the second operand, is
  divided by two, rounded according to the rounding mode specified in the instruction
  and loaded into the target register. The complex numbers are organized in the source
  register such that halfword H1 contains the real component and halfword H0 contains
  the imaginary component. The MPYCXJD2 instruction format is shown in Fig. 5A.
  The syntax and operation description 510 is shown in Fig. 5B.

description 410 is shown in Fig. 4B.

All of the above instructions 200, 300, 400 and 500 complete in 2 cycles and are pipeline-able. That is, another operation can start executing on the execution unit after the first cycle. All complex multiplication instructions return a word containing the real and imaginary part of the complex product in half words H1 and H0 respectively. WO 00/22503 PCT/US99/23494

To preserve maximum accuracy, and provide flexibility to programmers, four possible rounding modes are defined:

- · Round toward the nearest integer (referred to as ROUND)
- Round toward 0 (truncate or fix, referred to as TRUNC)

5

- Round toward infinity (round up or ceiling, the smallest integer greater than or equal to the argument, referred to as CEIL)
- Round toward negative infinity (round down or floor, the largest integer smaller than or equal to the argument, referred to as FLOOR).

Hardware suitable for implementing the multiply complex instructions is shown in Fig. 6 and Fig. 7. These figures illustrate a high level view of the hardware 10 apparatus 600 and 700 appropriate for implementing the functions of these instructions. This hardware capability may be advantageously embedded in the ManArray multiply accumulate unit (MAU), one of the execution units 131 of Fig. 1 and in each PE, along with other hardware capability supporting other MAU instructions. As a pipelined operation, the first execute cycle begins with a read of the source register operands from the compute register file (CRF) shown as registers 603 and 605 in Fig. 6 and as registers 111, 127, 127", 127", and 127" in Fig. 1. These register values are input to the MAU logic after some operand access delay in halfword data paths as indicated to the appropriate multiplication units 607, 609, 611, and 613 of Fig. 6. The outputs of the multiplication operation units, XR\*YR 607, XR\*Y1 20 609, X<sub>1</sub>\*Y<sub>R</sub> 611, and X<sub>1</sub>\*Y<sub>1</sub>613, are stored in pipeline registers 615, 617, 619, and 621. respectively. The second execute cycle, which can occur while a new multiply complex instruction is using the first cycle execute facilities, begins with using the stored pipeline register values, in pipeline register 615, 617, 619, and 621, and appropriately adding in adder 625 and subtracting in subtractor 623 as shown in Fig. 6. The add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction. The values generated by the apparatus 600 shown in Fig. 6 contain a maximum precision of calculation which exceeds 16-bits. Consequently, the appropriate bits must be selected and rounded as indicated in the instruction before storing the final results. The selection of the bits and rounding occurs in selection and rounder circuit 627. The two 16-bit rounded results are then stored in the appropriate halfword position of the target register 629 which is located in the compute register file (CRF). The divide by two variant of the multiply complex instruction 300 selects a different set of bits as

specified in the instruction through block 627. The hardware 627 shifts each data value right by an additional 1-bit and loads two divided-by-2 rounded and shifted values into each half word position in the target registers 629 in the CRF.

The hardware 700 for the multiply complex conjugate instruction 400 is shown in Fig. 7. The main difference between multiply complex and multiply complex conjugate is in adder 723 and subtractor 725 which swap the addition and subtraction operation as compared with Fig. 6. The results from adder 723 and subtractor 725 still need to be selected and rounded in selection and rounder circuit 727 and the final rounded results stored in the target register 729 in the CRF. The divide by two variant of the multiply complex conjugate instruction 500 selects a different set of bits as specified in the instruction through selection and rounder circuit 727. The hardware of circuit 727 shifts each data value right by an additional 1-bit and loads two divided-by-2 rounded and shifted values into each half word position in the target registers 729 in the CRF.

### The FFT Algorithm

15

20

The power of indirect VLIW parallelism using the complex multiplication instructions is demonstrated with the following fast Fourier transform (FFT) example. The algorithm of this example is based upon the sparse factorization of a discrete Fourier transform (DFT) matrix. Kronecker-product mathematics is used to demonstrate how a scalable algorithm is created.

The Kronecker product provides a means to express parallelism using mathematical notation. It is known that there is a direct mapping between different tensor product forms and some important architectural features of processors. For example, tensor matrices can be created in parallel form and in vector form. J. Granata, M. Conner, R. Tolimieri, The Tensor Product: A Mathematical Programming Language for FFTs and other Fast DSP Operations, 25 IEEE SP Magazine, January 1992, pp. 40 - 48. The Kronecker product of two matrices is a block matrix with blocks that are copies of the second argument multiplied by the corresponding element of the first argument. Details of an exemplary calculation of matrix vector products

$$v = (I_m \otimes A)x$$

are shown in Fig. 10A. The matrix is block diagonal with m copies of A. If vector x was distributed block-wise in m processors, the operation can be done in parallel without any

communication between the processors. On the other hand, the following calculation, shown in detail in Fig. 10B.

$$v = (A \otimes I_m)x$$

requires that x be distributed physically on m processors for vector parallel computation.

The two Kronecker products are related via the identity

$$I \otimes A = P(A \otimes I)P^T$$

where P is a special permutation matrix called stride permutation and  $P^T$  is the transpose permutation matrix. The stride permutation defines the required data distribution for a parallel operation, or the communication pattern needed to transform block distribution to evelic and vice-versa.

The mathematical description of parallelism and data distributions makes it possible to conceptualize parallel programs, and to manipulate them using linear algebra identities and thus better map them onto target parallel architectures. In addition, Kronecker product notation arises in many different areas of science and engineering. The Kronecker product simplifies the expression of many fast algorithms. For example, different FFT algorithms correspond to different sparse matrix factorizations of the Discrete Fourier Transform (DFT), whose factors involve Kronecker products. Charles F. Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM, 1992, pp 78-80.

The following equation shows a Kronecker product expression of the FFT algorithm, based on the Kronecker product factorization of the DFT matrix.

$$F_n = (F_p \otimes I_m)D_{p,m}(I_p \otimes F_m)P_{n,p}$$
where:

n is the length of the transform

p is the number of PEs

m = n/p

25

The equation is operated on from right to left with the  $P_{n,p}$  permutation operation occurring first. The permutation directly maps to a direct memory access (DMA) operation that specifies how the data is to be loaded in the PEs based upon the number of PEs p and length of the transform n.

$$F_n = (F_p \otimes I_m)D_{p,m}(I_p \otimes F_m)P_{n,p}$$

30

where  $P_{n,p}$  corresponds to DMA loading data with stride p to local PE memories.

In the next stage of operation all the PEs execute a local FFT of length m=n/p with local data. No communications between PEs is required.

$$F_n = (F_p \otimes I_m)D_{p,m}(I_p \otimes F_m)P_{n,p}$$

where  $(I_p \otimes F_m)$  specifies that all PEs execute a local FFT of length m sequentially, with local data.

In the next stage, all the PEs scale their local data by the twiddle factors and collectively execute m distributed FFTs of length p. This stage requires inter-PE communications.

$$F_n = (F_p \otimes I_m) D_{p,m} (I_p \otimes F_m) P_{n,p}$$

where  $(F_p \otimes I_n)D_{p,m}$  specifies that all PEs scale their local data by the twiddle factors and collectively execute multiple FFTs of length p on distributed data. In this final stage of the FFT computation, a relatively large number m of small distributed FFTs of size p must be calculated efficiently. The challenge is to completely overlap the necessary communications with the relatively simple computational requirements of the FFT.

The sequence of illustrations of Figs. 9A-9H outlines the ManArray distributed FFT algorithm using the indirect VLIW architecture, the multiply complex instructions, and operating on the 2x2 ManArray processor 100 of Fig. 1. The signal flow graph for the small FFT is shown in Fig. 8 and also shown in the right-hand-side of Figs. 9A-9H. In Fig. 8, the operation for a 4 point FFT is shown where each PE executes the operations shown on a horizontal row. The operations occur in parallel on each vertical time slice of operations as shown in the signal flow graph figures in Figs. 9A-9H. The VLIW code is displayed in a tabular form in Figs 9A-9H that corresponds to the structure of the ManArray architecture and the iVLIW instruction. The columns of the table correspond to the execution units available in the ManArray PE: Load Unit, Arithmetic Logic Unit (ALU), Multiply Accumulate Unit (MAU), Data Select Unit (DSU) and the Store Unit. The rows of the table can be interpreted as time steps representing the execution of different iVLIW lines.

The technique shown is a software pipeline implemented approach with iVLIWs. In Figs. 9A-91, the tables show the basic pipeline for PE3 155. Fig. 9A represents the input of the data X and its corresponding twiddle factor W by loading them from the PEs local memories, using the load indirect (Lii) instruction. Fig. 9B illustrates the complex arguments X and W which are multiplied using the MPYCX instruction 200, and Fig. 9C illustrates the communications operation between PEs, using a processing element exchange (PEXCHG) instruction. Further details of this instruction are found in U.S. Application Serial No. 09/169,256 entitled "Methods and Apparatus for ManArray PE-PE Switch Control" filed October 9, 1998. Fig. 9D illustrates the local and received quantities are added or subtracted (depending upon the processing element, where for PE3 a subtract (sub) instruction is used). Fig. 9E illustrates the result being multiplied by -i on PE3, using the MPYCX instruction. Fig. 9F illustrates another PE-to-PE communications operation where the previous product is exchanged between the PEs, using the PEXCHG instruction. Fig. 9G illustrates the local and received quantities are added or subtracted (depending upon the processing element, where for PE3 a subtract (sub) instruction is used). Fig. 9H illustrates the step where the results are stored to local memory, using a store indirect (sii) instruction.

The code for PEs 0, 1, and 2 is very similar, the two subtractions in the arithmetic logic unit in steps 9D and 9G are substituted by additions or subtractions in the other PEs as required by the algorithm displayed in the signal flow graphs. To achieve that capability and the distinct MPYCX operation in Fig. 9E shown in these figures, synchronous MIMD capability is required as described in greater detail in United States Patent Application Serial No. 09/187,539 filed November 6, 1998 and incorporated by reference herein in its entirety. By appropriate packing, a very tight software pipeline can be achieved as shown in Fig. 9I for this FFT example using only two VLIWs.

In the steady state, as can be seen in Fig. 91, the Load, ALU, MAU, and DSU units are fully utilized in the two VLIWs while the store unit is used half of the time. This high utilization rate using two VLIWs leads to very high performance. For example, a 256-point complex FFT can be accomplished in 425 cycles on a 2x2 ManArray.

As can be seen in the above example, this implementation accomplishes the following:

- An FFT butterfly of length 4 can be calculated and stored every two cycles, using four PEs.
- The communication requirement of the FFT is completely overlapped by the computational requirements of this algorithm.

- The communication is along the hypercube connections that are available as a subset
  of the connections available in the ManArray interconnection network.
- The steady state of this algorithm consists of only two VLIW lines (the source code is two VLIW lines long).
- All execution units except the Store unit are utilized all the time, which lead us to conclude that this implementation is optimal for this architecture.

#### Problem Size Discussion

The equation:

$$F_n = (F_p \otimes I_m) D_{p,m} (I_p \otimes F_m) P_{n,p}$$

where:

10

25

n is the length of the transform, p is the number of PEs, and

p is the number of 1 Es, at

is parameterized by the length of the transform n and the number of PEs, where m=n/p relates to the size of local memory needed by the PEs. For a given power-of-2 number of processing elements and a sufficient amount of available local PE memory, distributed FFTs of size p can be calculated on a ManArray processor since only hypercube connections are required. The hypercube of p or fewer nodes is a proper subset of the ManArray network. When p is a multiple of the number of processing elements, each PE emulates the operation of more than one virtual node. Therefore, any size of FFT problem can be handled using the above equation on any size of ManArray processor.

For direct execution, in other words, no emulation of virtual PEs, on a ManArray of size p, we need to provide a distributed FFT algorithm of equal size. For p=1, it is the sequential FFT. For p=2, the FFT of length 2 is the butterfly:

Y1=x0-w\*X1

where X0 and Y0 reside in or must be saved in the local memory of PE0 and X1 and Y1 on PE1, respectively. The VLIWs in PE0 and PE1 in a 1x2 ManArray processor (p=2) that are required for the calculation of multiple FFIs of length 2 are shown in Fig. 91 which shows that two FFI results are produced every two cycles using four VLIWs.

#### **Extending Complex Multiplication**

It is noted that in the two-cycle complex multiplication hardware described in Figs. 6 and 7, the addition and subtraction blocks 623, 625, 723, and 725 operate in the second execution cycle. By including the MPYCX, MPYCXD2, MPYCXJ, and MPYCXJD2 instructions in the ManArray MAU, one of the execution units 10 oFig. 1, the complex multiplication operations can be extended. The ManArray MAU also supports multiply accumulate operations (MACs) as shown in Figs. 11A and 12A for use in general digital signal processing (DSP) applications. A multiply accumulate instruction (MPYA) 1100 as shown in Fig. 11A, and a sum two product accumulate instruction (SUM2PA) 1200 as shown in Fig. 12A, are defined as follows.

In the MPYA instruction 1100 of Fig. 11A, the product of source registers Rx and Ry is added to target register Rt. The word multiply form of this instruction multiplies two 32-bit values producing a 64-bit result which is added to a 64-bit odd/even target register. The dual halfword form of MPYA instruction 1100 multiplies two pairs of 16-bit values producing two 32-bit results: one is added to the odd 32-bit word, the other is added to the even 32-bit word of the odd/even target register pair. Syntax and operation details 1110 are shown in Fig. 11B. In the SUM2PA instruction 1200 of Fig. 12A, the product of the high halfwords of source registers Rx and Ry is added to the product of the low halfwords of Rx and Ry and the result is added to target register Rt and then stored in Rt. Syntax and operation details 1210 are shown in Fig. 12B.

Both MPYA and SUMP2A generate the accumulate result in the second cycle of the two-cycle pipeline operation. By merging MPYCX, MPYCXD2, MPYCXJ, and MPYCXJD2 instructions with MPYA and SUMP2A instructions, the hardware supports the extension of the complex multiply operations with an accumulate operation. The mathematical operation is defined as:  $Z_T = Z_R + X_R Y_R - X_t Y_1 + i(Z_t + X_R Y_t + X_t Y_t)$ , where  $X = X_R + iX_t Y_t = Y_R + iY_t$  and i is an imaginary number, or the square root of negative one, with  $i^2 = -1$ . This complex multiply accumulate is calculated in a variety of contexts, and it has been recognized that it will be highly advantageous to perform this calculation faster and more efficiently.

For this purpose, an MPYCXA instruction 1300 (Fig. 13A), an MPYCXAD2 instruction 1400 (Fig. 14A), an MPYCXIA instruction 1500 (Fig. 15A), and an MPYCXJAD2 instruction 1600 (Fig. 16A) define the special hardware instructions that handle the multiplication with accumulate for complex numbers. The MPYCXA instruction

1300, for multiplication of complex numbers with accumulate is shown in Fig. 13. Utilizing this instruction, the accumulated complex product of fwo source operands is rounded according to the rounding mode specified in the instruction and loaded into the target register. The complex numbers are organized in the source register such that halfword HI contains the real component and halfword H0 contains the imaginary component. The MPYCXA instruction format is shown in Fig. 13A. The syntax and operation description 1310 is shown in Fig. 13B.

The MPYCXAD2 instruction 1400, for multiplication of complex numbers with accumulate, with the results divided by two is shown in Fig. 14A. Utilizing this instruction, the accumulated complex product of two source operands is divided by two, rounded according to the rounding mode specified in the instruction, and loaded into the target register. The complex numbers are organized in the source register such that halfword H1 contains the real component and halfword H0 contains the imaginary component. The MPYCXAD2 instruction format is shown in Fig. 14A. The syntax and operation description 1410 is shown in Fig. 14B.

The MPYCXJA instruction 1500, for multiplication of complex numbers with accumulate where the second argument is conjugated is shown in Fig. 15A. Utilizing this instruction, the accumulated complex product of the first source operand times the conjugate of the second source operand, is rounded according to the rounding mode specified in the instruction and loaded into the target register. The complex numbers are organized in the source register such that halfword H1 contains the real component and halfword H0 contains the imaginary component. The MPYCXJA instruction format is shown in Fig. 15A. The syntax and operation description 1510 is shown in Fig. 15B.

The MPYCXJAD2 instruction 1600, for multiplication of complex numbers with accumulate where the second argument is conjugated, with the results divided by two is shown in Fig. 16A. Utilizing this instruction, the accumulated complex product of the first source operand times the conjugate of the second operand, is divided by two, rounded according to the rounding mode specified in the instruction and loaded into the target register. The complex numbers are organized in the source register such that halfword H1 contains the real component and halfword H0 contains the imaginary component. The MPYCXJAD2 instruction format is shown in Fig. 16A. The syntax and operation description 1610 is shown in Fig. 16B.

All instructions of the above instructions 1100, 1200, 1300, 1400, 1500 and 1600 complete in two cycles and are pipeline-able. That is, another operation can start executing on the execution unit after the first cycle. All complex multiplication instructions 1300, 1400, 1500 and 1600 return a word containing the real and imaginary part of the complex product in half words HI and HO respectively.

To preserve maximum accuracy, and provide flexibility to programmers, the same four rounding modes specified previously for MPYCX, MPYCXD2, MPYCXJ, and MPYCXJD2 are used in the extended complex multiplication with accumulate.

Hardware 1700 and 1800 for implementing the multiply complex with accumulate instructions is shown in Fig. 17 and Fig. 18, respectively. These figures illustrate the high level view of the hardware 1700 and 1800 appropriate for these instructions. The important changes to note between Fig. 17 and Fig. 6 and between Fig. 18 and Fig. 7 are in the second stage of the pipeline where the two-input adder blocks 623, 625, 723, and 725 are replaced with three-input adder blocks 1723, 1725, 1823, and 1825. Further, two new half word source operands are used as inputs to the operation. The Rt.H1 1731 (1831) and Rt.H0 1733 (1833) values are properly aligned and selected by multiplexers 1735 (1835) and 1737 (1837) as inputs to the new adders 1723 (1823) and 1725 (1825). For the appropriate alignment, Rt.H1 is shifted right by 1-bit and Rt.H0 is shifted left by 15-bits. The add/subtract, add/sub blocks 1723 (1823) and 1725 (1825), operate on the input data and generate the outputs as shown. The add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction. The results are rounded and bits 30-15 of both 32-bit results are selected 1727 (1827) and stored in the appropriate half word of the target register 1729 (1829) in the CRF. It is noted that the multiplexers 1735 (1835) and 1737 (1837) select the zero input, indicated by the ground sym

bol, for the non-accumulate versions of the complex multiplication series of instructions

While the present invention has been disclosed in the context of various aspects of presently preferred embodiments, it will be recognized that the invention may be suitably applied to other environments consistent with the claims which follow.

25

We claim:

5

15

25

 An apparatus for the efficient processing of complex multiplication computations, the apparatus comprising:

at least one controller sequence processor (SP);

- a memory for storing process control instructions;
  - a first multiply complex numbers instruction stored in the memory and operative to control the PEs to carry out a multiplication operation involving a pair of complex numbers; and

hardware for implementing the first multiply complex numbers instruction.

- The apparatus of claim 1 further comprising a plurality of processing elements (PEs) interconnected with said SP and arranged in an N x N array interconnected in a manifold array interconnection network.
- The apparatus of claim 1 wherein the first multiply complex instruction completes execution in 2 cycles.
- The apparatus of claim 1 wherein the first multiply complex instruction is tightly pipelineable.
- 5. The apparatus of claim 1 wherein each complex number is stored as a word, each word comprising a first half word and a second half word, with a real component of each complex number being stored as the first half word and an imaginary component of each complex number being stored as the second half word.
- 6. The apparatus of claim 1 wherein the first multiply complex instruction includes a plurality of rounding modes, the rounding modes including:

rounding toward a nearest integer;

rounding toward zero;

rounding toward infinity; and

rounding toward negative infinity.

- The apparatus of claim 1 wherein the first multiply complex numbers instruction
- is one of the following group of instructions: a multiply complex numbers (MPYCX), a
  multiply complex numbers instruction (MPYCXI) operative to carry out the multiplication of
  a pair of complex numbers where an argument is conjugated, a multiply complex numbers
  instruction (MPYCXD2) operative to carry out the multiplication of a pair of complex
  numbers with a result divided by two, and a multiply complex numbers instruction

25

(MPYCXJD2) operative to carry out the multiplication of a pair of complex numbers where an argument is conjugated with a result divided by two.

- The apparatus of claim 1 further comprising a multiply accumulate unit including the memory for storing the first multiply complex numbers instruction.
- The apparatus of claim 8 wherein the multiply accumulate unit operates in response to a multiply accumulate instruction (MPYA) to extend a multiplication operation with an accumulate operation.
- 10. The apparatus of claim 8 wherein the multiply accumulate unit operates in response to a sum two product accumulate instruction (SUM2PA) to extend two multiplication operations with an accumulate operation.
  - 11. The apparatus of claim 9 wherein the multiply accumulate unit operates in response to a multiply complex with accumulate instruction (MPYCXA) to carry out the multiplication of a pair of complex numbers with accumulation of a third complex number.
  - The apparatus of claim 11 wherein the MPYCXA instruction completes execution in 2 cycles.
  - The apparatus of claim 12 wherein the MPYCXA instruction is tightly pipelineable.
  - 14. The apparatus of claim 1 further comprising one or more of the following additional instructions (MPYCXA, MPYCXAD2, MPYCXJA or MPYCXJAD2) stored in the memory to carry out complex multiplication operations pipelined in 2 cycles.
  - 15. A method for the computation of an FFT by a plurality of processing elements (PEs), the method comprising the steps of:

loading input data from a memory into each PE in a cyclic manner; calculating a local FFT by each PE;

- multiplying by the twiddle factors and calculating a FFT by the cluster of PEs; and loading the FFTs into the memory.
  - 16. A method for the computation of a distributed FFT by an  $N \times N$  processing element (PE) array, the method comprising the steps of:

loading a complex number x and a corresponding twiddle factor w from a memory

into each of the PEs;

calculating a first product by the multiplication of the complex numbers x and w; transmitting the first product from each of the PEs to another PE in the N x N array; receiving the first product and treating it as a second product in each of the PEs;

20

25

30

selectively adding or subtracting the first product and the second product to form a first result:

calculating a third product in selected PEs;

transmitting the first result or third product in selected PEs to another PE in the N x N  $^{5}$  array;

selectively adding or subtracting the received values to form a second result; and storing the second results in the memory.

17. A method for efficient computation by a 2 x 2 processing element (PE) array interconnected in a manifold array interconnection network, the array comprising four PEs (PE), PE1, PE2 and PE3), the method comprising the steps of:

loading a complex number x and a corresponding twiddle factor w from a memory into each of the four PEs, complex number x including subparts x0, x1, x2 and x3, twiddle factor w including subparts w0, w1, w2 and w3;

multiplying the complex numbers x and w, such that

PE0 multiplies x0 and w0 to produce a product0,

PEI multiplies x1 and w1 to produce a product1.

PE2 multiplies x2 and w2 to produce a product2, and

PE3 multiplies x3 and w3 to produce a product3;

transmitting the product0, the product1, the product2 and the product3, such that

PE0 transmits the product0 to PE2.

PE1 transmits the product1 to PE3.

PE2 transmits the product2 to PE0, and

PE3 transmits the product3 to PE1; and

performing arithmetic logic operations, such that

PE0 adds the product0 and the product2 to produce a sum t0,

PE1 adds the product1 and the product3 to produce a sum t2,

PE2 subtracts the product2 from the product0 to produce a sum t1, and

PE3 subtracts the product3 from the product1 to produce a result which is multiplied by -i to produce a sum t3.

The method of claim 17 further comprising the steps of :

transmitting the sums t0, t1, t2 and t3, such that

PE0 transmits t0 to PE1.

PE1 transmits t2 to PE0,

PE2 transmits t1 to PE3, and PE3 transmits t3 to PE2:

performing the arithmetic logic operations, such that

PE0 adds t0 and t2 to produce a v0.

PE1 subtracts t2 from t0 to produce a v1,

PE2 adds t1 and t3 to produce a y2, and

PE3 subtracts t3 from t1 to produce a v3; and

storing y0, y1, y2 and y3 in a memory.

- 19. A special hardware instruction for handling the multiplication with accumulate for two complex numbers from a source register whereby utilizing said instruction and accumulated complex product of two source operands is rounded according to a rounding mode specified in the instruction and loaded into a target register with the complex numbers organized in the source such that a halfword (H1) contains the real component and a halfword (H0) contains the imaginary component.
- 15 20. The special hardware instruction of claim 19 wherein the accumulated complex product is divided by two before it is rounded.
  - 21. An apparatus to efficiently fetch instructions including complex multiplication instructions and an accumulate form of multiplication instructions from a memory element and dispatch the fetched instruction to at least one of a plurality of multiply complex and multiply with accumulate execution units to carry out the instruction specified operation, the apparatus comprising:
    - a memory element;

means for fetching said instructions from the memory element;

- a plurality of multiply complex and multiply with accumulate execution units; and means to dispatch the fetched instruction to at least one of said plurality of execution units to carry out the instruction specified operation.
  - The apparatus of claim 21 further comprising:

an instruction register to hold a dispatched multiply complex instruction (MPYCX); means to decode the MPYCX instruction and control the execution of the MPYCX

30 instruction:

25

two source registers each holding a complex number as operand inputs to the multiply complex execution hardware;

four multiplication units to generate terms of the complex multiplication;

20

25

four pipeline registers to hold the multiplication results;

- an add function which adds two of the multiplication results from the pipeline registers for the imaginary component of the result:
- a subtract function which subtracts two of the multiplication results from the pipeline registers for the real component of the result;
  - a round and select unit to format the real and imaginary results; and
  - a result storage location for saving the final multiply complex result, whereby the apparatus is operative for the efficient processing of multiply complex computations.
  - The apparatus of claim 21 wherein the means for fetching said instructions is a sequence processor (SP) controller.
    - The apparatus of claim 22 wherein the round and select unit provides a shift right as a divide by 2 operation for a multiply complex divide by 2 instruction (MPYCXD2).
      - 25. The apparatus of claim 21 further comprising:
    - an instruction register to hold a dispatched multiply complex instruction (MPYCXJ);
      means to decode the MPYCXJ instruction and control the execution of the MPYCXJ
      instruction:

two source registers each holding a complex number as operand inputs to the multiply complex execution hardware;

four multiplication units to generate terms of the complex multiplication; four pipeline registers to hold the multiplication results;

- an add function which adds two of the multiplication results from the pipeline registers for the real component of the result:
- a subtract function which subtracts two of the multiplication results from the pipeline registers for the imaginary component of the result;
  - a round and select unit to format the real and imaginary results; and
- a result storage location for saving the final multiply complex conjugate result, whereby the apparatus is operative for the efficient processing of multiply complex conjugate computations.
- 26. The apparatus of claim 25 wherein the round and select unit provides a shift or right as a divide by 2 operation for a multiply complex conjugate divide by 2 instruction (MPYCXJD2).
  - The apparatus of claim 21 further comprising:

30 and

WO 00/22503 PCT/US99/23494 21

an instruction register to hold the dispatched multiply accumulate instruction (MPYA);

means to decode the MPYA instruction and control the execution of the MPYA instruction:

- 5 two source registers each holding a source operand as inputs to the multiply accumulate execution hardware:
  - at least two multiplication units to generate two products of the multiplication;
  - at least two pipeline registers to hold the multiplication results; at least two accumulate operand inputs to the second pipeline stage accumulate
- hardware:
  - at least two add functions which each adds the results from the pipeline registers with the third accumulate operand creating two multiply accumulate results;
    - a round and select unit to format the results if required by the MPYA instruction; and
- a result storage location for saving the final multiply accumulate result, whereby the apparatus is operative for the efficient processing of multiply accumulate computations.
  - The apparatus of claim 21 further comprising:
  - an instruction register to hold a dispatched multiply accumulate instruction (SUM2PA);
- means to decode the SUM2PA instruction and control the execution of the SUM2PA instruction: 20
  - at least two source registers each holding a source operand as inputs to the SUM2PA execution hardware:
    - at least two multiplication units to generate two products of the multiplication;
    - at least two pipeline registers to hold the multiplication results;
- at least one accumulate operand input to the second pipeline stage accumulate 25 hardware:
  - at least one add function which adds the results from the pipeline registers with the third accumulate operand creating a SUM2PA result;
  - a round and select unit to format the results if required by the SUM2PA instruction;
    - a result storage location for saving the final result, whereby the apparatus is operative for the efficient processing of sum of 2 products accumulate computations
      - 29. The apparatus of claim 21 further comprising:

an instruction register to hold the dispatched multiply complex accumulate instruction (MPYCXA):

means to decode the MPYCXA instruction and control the execution of the MPYCXA instruction:

5 two source registers each holding a complex number as operand inputs to the multiply complex accumulate execution hardware:

> four multiplication units to generate terms of the complex multiplication; four pipeline registers to hold the multiplication results;

at least two accumulate operand inputs to the second pipeline stage accumulate

10 hardware;

20

25

an add function which adds two of the multiplication results from the pipeline registers and also adds one of the accumulate operand input for the imaginary component of the result:

a subtract function which subtracts two of the multiplication results from the pipeline registers and also adds the other accumulate operand input for the real component of the result;

a round and select unit to format the real and imaginary results; and
a result storage location for saving the final multiply complex accumulate result,
whereby the apparatus is operative for the efficient processing of multiply complex
accumulate computations.

- 30. The apparatus of claim 29 wherein the round and select unit provides a shift right as a divide by 2 operation for a multiply complex accumulate divide by 2 instruction (MPYCXAD2).
  - 31. The apparatus of claim 21 further comprising:

an instruction register to hold the dispatched multiply complex conjugate accumulate instruction (MPYCXJA);

means to decode the MPYCXJA instruction and control the execution of the MPYCXJA instruction;

two source registers each holding a complex number as operand inputs to the multiply

complex accumulate execution hardware;

four multiplication units to generate terms of the complex multiplication; four pipeline registers to hold the multiplication results;

20

25

at least two accumulate operand inputs to the second pipeline stage accumulate hardware:

- an add function which adds two of the multiplication results from the pipeline
  registers and also adds one of the accumulate operand input for the real component of the
  result:
  - a subtract function which subtracts two of the multiplication results from the pipeline registers and also adds the other accumulate operand input for the imaginary component of the result;
    - a round and select unit to format the real and imaginary results; and
  - a result storage location for saving the final multiply complex conjugate accumulate result, whereby the apparatus is operative for the efficient processing of multiply complex conjugate accumulate computations.
- The apparatus of claim 31 wherein the round and select unit provides a shift
  right as a divide by 2 operation for a multiply complex conjugate accumulate divide by 2
  instruction (MPYCXJAD2).
  - 33. The apparatus of claim 21 wherein the complex multiplication instructions and accumulate form of multiplication instructions include MPYCX, MPYCXD2, MPYCXJ, MPYCXJD2, MPYCXJA, MPYCXAD2, MPYCXJAD2 instructions, and all of these instructions complete execution in 2 cycles.
  - 34. The apparatus of claim 21 wherein the complex multiplication instructions and accumulate form of multiplication instructions include MPYCX, MPYCXD2, MPYCXJ, MPYCXJD2, MPYCXJA2, MPYCXJA2, MPYCXJAD2 instructions, and all of these instructions are tightly pipelineable.
    - An apparatus for the efficient processing of an FFT, the apparatus comprising:
       at least one controller sequence processor (SP);
  - a plurality of processing elements (PEs) arranged in an NxN array interconnected in a manifold (ManArray) interconnection network; and
    - a memory for storing instructions to be processed by the SP and by the array of PEs.
- 36. The apparatus of claim 22 wherein the add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.

WO 00/22503 PCT/US99/23494 24

The apparatus of claim 25 wherein the add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.

- 38. The apparatus of claim 29 wherein the add function and subtract function are 5 selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.
  - The apparatus of claim 31 wherein the add function and subtract function are selectively controlled functions allowing either addition or subtraction operations as specified by the instruction.



N. N

MPYCX-MULTIPLY COMPLEX ENCODING

|          | ,                                                                               | $\overline{}$ |
|----------|---------------------------------------------------------------------------------|---------------|
|          | 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 | CE3           |
|          | 2                                                                               | S             |
|          | 9                                                                               |               |
|          | 7                                                                               |               |
|          | 8                                                                               | €             |
|          | 6                                                                               | _             |
|          | 9                                                                               |               |
|          | Ξ                                                                               |               |
|          | 3 12 1                                                                          |               |
|          | 13                                                                              | ž             |
|          | 14                                                                              |               |
|          | 15                                                                              |               |
|          | 16                                                                              |               |
|          | 17                                                                              |               |
|          | 18                                                                              | -             |
|          | 19                                                                              | 꿆             |
|          | 30 29 28 27 26 25 24 23 22 21 20 19                                             |               |
|          | 21                                                                              |               |
|          | 22                                                                              | MAUopcode     |
|          | 23                                                                              | 8             |
|          | 24                                                                              | 힑             |
|          | 25                                                                              | ≨             |
|          | 28                                                                              |               |
|          | 27                                                                              | Н             |
|          | 28                                                                              | 3             |
| ENCODING | 29                                                                              | GROUP S/P     |
| Š        | 33                                                                              | 5             |
| 2        | 31                                                                              | 8             |
| _        |                                                                                 |               |
|          |                                                                                 |               |

FIG. 2A

Syntax/Operation

Dual Halfwords Do operation only if T/F condition is satisfied in ACFs Rt.H0  $\leftarrow$  round ((Rx.H1 \* Ry.H0 + Rx.H0 \* Ry.H1)/2 Rt.H1  $\leftarrow$  round ((Rx.H1 \* Ry.H1 - Rx.H0 \* Ry.H0)/2." Do operation below but do not affect ACFs OPERATION OPERANDS RoundMode RoundMode RoundMode Rt, Rx, Ry Rt, Rx, Ry Rt, Rx, Ry MPYCX.[CNVZ].[SP]M.2SH TFJ.MPYCX, ISPIM.2SH MPYCX.[SP]M.2SH INSTRUCTION

2/23

None

None 요 ī

R=FLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) | THE OPERAND RoundMode IS ONE OF THE FOLLOWING: R-ROUND FOR ROUNDING TO THE NEAREST INTEGER V=1 IF AN INTEGER OVERFLOW OCCURS ON EITHER OPERATION PRIOR TO THE RECEIL FOR ROUNDING TOWARDS POSITIVE INFINITY R=TRUNC FOR ROUNDING TOWARDS ZERO

Z=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE

N=MSB OF RESULT

DIVIDE, 0 OTHERWISE

C=NOT AFFECTED

FIG. 2B

CYCLES:

R=FLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY R=ROUND FOR ROUNDING TO THE NEAREST INTEGER

MPYCXD2-MULTIPLY COMPLEX DIVIDE BY 2 ENCODING

FIG. 3A

Syntax/Operation

Dual Halfwords None lone Ξ 요 Do operation only if T/F condition is satisfied in ACFs Rt.H0  $\leftarrow$  round ((Rx.H1 \* Ry.H0 + Rx.H0 \* Ry.H1)/2 Rt.H1  $\leftarrow$  round ((Rx.H1 \* Ry.H1 - Rx.H0 \* Rx.H0)/2 Do operation below but do not affect ACFs OPERATION OPERANDS RoundMode RoundMode Rt, Rx, Ry Rt, Rx, Ry MPYCXDZ[CNVZ].[SP]M.2SH Rt, Rx, Ry TFJ.MPYCXD2.[SPJM.2SH MPYCXD2.[SP]M.2SH INSTRUCTION

3/23

ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) | THE OPERAND RoundMode IS ONE OF THE FOLLOWING: R=TRUNC FOR ROUNDING TOWARDS ZERO N=MSB OF RESULT

RoundMode

V=1 IF AN INTEGER OVERFLOW OCCURS ON EITHER OPERATION PRIOR TO THE R-CEIL FOR ROUNDING TOWARDS POSITIVE INFINITY Z=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE DIVIDE, 0 OTHERWISE

C=NOT AFFECTED

CYCLES:

R=ROUND FOR ROUNDING TO THE NEAREST INTEGER

MPYCXJ-MULTIPLY COMPLEX CONJUGATE ENCODING

EG.

Synfax/Operation

| Office                            |                         |                                                                   |                |      |
|-----------------------------------|-------------------------|-------------------------------------------------------------------|----------------|------|
| INSTRUCTION                       | OPERANDS OPERATION      | OPERATION                                                         | ACF            |      |
|                                   |                         | Dual                                                              | Dual Halfwords |      |
| MPYCXJ.[SP]M.2SH                  | Rt, Rx, Ry<br>RoundMode | Do operation below but do not affect ACFs                         | None           | 4/23 |
| MPYCXJ[CNVZ],[SP]M.2SH Rt, Rx, Ry | Rt, Rx, Ry              | $Rt.H1 \leftarrow round ((Rx.H1 * Ry.H1 + Rx.H0 * Ry.H0)/2^{15})$ | F              |      |
|                                   | RoundMode               | $RtH0 \leftarrow round ((Rx.H0 * Ry.H1 - Rx.H1 * Ry.H0)/2^{15})$  | 2              |      |
| [TF].MPYCXJ.[SP]M.2SH             | Rt, Rx, Ry              | Do operation only if T/F condition is satisfied in ACFs           | None           |      |
|                                   | RoundMode               |                                                                   |                |      |

ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) | THE OPERAND RoundMode IS ONE OF THE FOLLOWING: N=MSB OF RESULT

R=FLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY R=CEIL FOR ROUNDING TOWARDS POSITIVE INFINITY R=TRUNC FOR ROUNDING TOWARDS ZERO V=1 IF AN INTEGER OVERFLOW OCCURS ON EITHER OPERATION, 0 OTHERWISE Z=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE C=NOT AFFECTED

CYCLES: 2

FIG. 4B

MPYCXJD2-MULTIPLY COMPLEX CONJUGATE DIVIDE BY 2

ENCODING

7 00s

FIG. 5A

Syntax/Operation

| ,        |                    |                       | 5/2                                       | 23        |                                                                |                                                             |                                                         |           |
|----------|--------------------|-----------------------|-------------------------------------------|-----------|----------------------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------|-----------|
|          | ACF                | <b>Dual Halfwords</b> | None                                      |           | Ħ                                                              | 9                                                           | None                                                    |           |
|          |                    | Dual                  |                                           |           |                                                                |                                                             |                                                         |           |
|          | OPERATION          |                       | Do operation below but do not affect ACFs |           | Rt.H1 ← round ((Rx.H1 * Ry.H1 + Rx.H0 * Ry.H0)/2 <sup>16</sup> | $RtH0 \leftarrow round ((Rx.H0 * Ry.H1 - Rx.H1 * Ry.H0)/2)$ | Do operation only if T/F condition is satisfied in ACFs |           |
|          | OPERANDS OPERATION |                       | Rt, Rx, Ry                                | RoundMode | Rt, Rx, Ry                                                     | RoundMode                                                   | Rt, Rx, Ry                                              | RoundMode |
| un<br>ou | INSTRUCTION        |                       | MPYCXJD2.[SP]M.2SH                        | 1         | MPYCXJDZ[CNVZ].[SP]M.2SH Rt, Rx, Ry                            |                                                             | [TF]:MPYCXJD2.[SP]M.2SH Rt, Rx, Ry                      |           |

ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) | THE OPERAND RoundMode IS ONE OF THE FOLLOWING:

R=FLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY Z=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE

V=1 IF AN INTEGER OVERFLOW OCCURS ON EITHER OPERATION, 0 OTHERWISE | R=CEIL FOR ROUNDING TOWARDS POSITIVE INFINITY N=MSB OF RESULT C=NOT AFFECTED

R-ROUND FOR ROUNDING TO THE NEAREST INTEGER

CYCLES:

FIG. 5B





8/23



FIG. 8

THE COMPLEX ARGUMENTS X AND W ARE MULTIPLIED: INPUT OF THE DATA X AND ITS CORRESPONDING TWIDDLE FACTOR W:



<u>Б</u>

A COMMUNICATIONS OPERATIONS BETWEEN PES:

|                                                                                                   |       |         |              |             |           | 10 | /2 | 3 |  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                  | 0                 |
|---------------------------------------------------------------------------------------------------|-------|---------|--------------|-------------|-----------|----|----|---|--|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|-------------------|
| DDED OR<br>SSING                                                                                  | STORE |         |              |             |           |    |    |   |  | FIG. 9D                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                  | <b>√</b>          |
| tities are ai<br>The proces                                                                       | DSO   |         |              | pexchg.pd.w |           |    |    |   |  | √₀ <b>FIG</b>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | <b>√</b> 2       | × <sup>ε</sup>    |
| THE LOCAL AND RECEIVED QUANTITIES ARE ADDED OR SUBTRACTED (DEPENDING UPON THE PROCESSING ELEMENT. | MAU   |         | mpycx.pm.2sh |             |           |    |    |   |  | 0 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 0                |                   |
| CAL AND REC<br>ACTED (DEPE<br>NT.                                                                 | ) ALU |         |              |             | sub.pa.2h |    |    |   |  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                  |                   |
| THE LO<br>SUBTR<br>ELEME                                                                          | LOAD  | iii p.w |              |             |           | L  |    |   |  | × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × | <br>×<br>×       |                   |
| TWEEN                                                                                             | STORE |         |              |             |           | Γ  |    |   |  | <sup>7</sup> ₀ FIG. 9C                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                  | 930               |
| ATIONS BE                                                                                         | DSO   |         |              | w.pd.bd.w   |           |    |    |   |  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | > Y <sub>2</sub> | × 3               |
| A COMMUNICATIONS OPERATIONS BETWEEN PES:                                                          | MAU   |         | mpycx.pm.2sh |             |           |    |    |   |  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | (*)              |                   |
| MMUNICA                                                                                           | ALU   |         |              |             |           |    |    |   |  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                  |                   |
| A CO<br>PEs:                                                                                      | LOAD  | ii.b.w  |              |             |           |    |    |   |  | \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | .,               | ٠. <sub>٤</sub> > |

THE RESULT IS MULTIPLIED BY -i ON PE3:

ANOTHER SET OF PE-TO-PE COMMUNICATIONS WHERE THE PREVIOUS PRODUCT IS EXCHANGED BETWEEN

|          |         |         |              |   |             |           |    |             | 11 | /2          | 3 |   |  |   |          |               |               |          |               |          |         | 0              |
|----------|---------|---------|--------------|---|-------------|-----------|----|-------------|----|-------------|---|---|--|---|----------|---------------|---------------|----------|---------------|----------|---------|----------------|
| STORE    |         |         |              |   |             |           |    |             |    |             |   |   |  |   |          | FIG. 9F       |               |          |               |          | ¥       | 960            |
| nsa      |         |         |              |   | w.bd.ghoxed |           |    |             |    | w.bd.ghoxed |   |   |  |   |          | <b>∃</b><br>∘ | >             | <u>-</u> | >             | 12       | >       | <sub>ლ</sub> _ |
| MAU      |         |         | mpycx.pm.2sh |   |             |           |    | mpycx.pm.2h |    |             |   |   |  |   |          |               | <u>(</u>      |          | _(            | -)       |         |                |
| ALU      |         |         |              |   |             | sub.pa.2h |    |             |    |             |   |   |  |   | (        |               | > (-          |          | \$            | )<br>(-) | \<br>\{ | 5              |
| THE PES: | M.D.W   | III.D.W |              |   |             |           |    |             |    |             |   |   |  | > | چ        | × 0 × 0       | <u>ج</u><br>ج | × ×      | ر<br>کن<br>کن | × K      | ي<br>م  | MEW.           |
| _        |         |         |              | _ | ,           |           |    | ,           |    |             |   | _ |  | 1 |          |               |               |          |               |          |         | 950            |
| STORE    |         |         |              |   |             |           |    |             |    |             |   |   |  |   |          | PIG. 9E       |               |          |               |          | ~       |                |
| DSO      |         |         |              |   | pexchg.pd.w |           |    |             |    |             |   |   |  |   | 1        | `             | >             | -        | >             | 7 72     | >       | - 3            |
| MAU      |         |         | mpycx.pm.2sh |   |             |           |    | mpycx.pm.2h |    |             |   |   |  |   | (        |               |               |          | )<br>}_       |          | 5       |                |
| ALU      |         |         |              |   |             | sub.pa.2h |    |             |    |             |   |   |  |   | ( )<br>F |               | ,<br>(        |          |               |          | 188     | .) KEM         |
| LOAD     | lii.p.w | lii.p.w |              |   |             |           | L. |             |    |             |   |   |  | , | ٠.<br>م  | >             | ٠ <u>.</u>    | . 57     | ×             | . ×:     | ×       | , <b>&gt;</b>  |

STORE

THE RESULTS ARE STORED TO LOCAL MEMORY: THE LOCAL AND RECEIVED QUANTITIES ARE

|                                                               | ادما     |         |         |              |             |           |             |             |           | 65 | <br>_ |   |
|---------------------------------------------------------------|----------|---------|---------|--------------|-------------|-----------|-------------|-------------|-----------|----|-------|---|
|                                                               | DSO      |         |         |              | pexchg.pd.w |           |             | w.bd.ghoxed |           |    |       |   |
|                                                               | MAU      |         |         | mpycx.pm.2sh |             |           | mpycx.pm.2h |             |           |    |       |   |
| <u>:</u>                                                      | LOAD ALU |         |         |              |             | sub.pa.2h |             |             | sub.pa.2h |    |       |   |
| . LAOIMEIN                                                    | LOAD     | W.d.ii  | W.d.ii  |              |             |           |             |             |           |    |       | × |
| 00 ON                                                         | STORE    |         |         |              |             |           |             |             |           |    |       |   |
| DEPENDIN<br>NT);                                              | DSO      |         |         |              | m.pd.ghoxed |           |             | pexchg.pd.w |           |    |       |   |
| ADDED OK SUBIRACIED (DEPENDING ON<br>THE PROCESSING ELEMENT); | MAU      |         |         | mpycx.pm.2sh |             |           | mpycx.pm.2h |             |           |    |       |   |
| ROCESS                                                        | LOAD ALU |         |         |              |             | sub.pa.2h |             |             | sub.pa.2h |    |       | × |
| THEF                                                          | LOAD     | W.d.iil | W.d.iii |              |             |           |             |             |           |    |       | _ |

12/23

sii.p.w

Picker Company of the company of the



13/23

| STORE |                  |                  |                       |                  |                             |                    |                             |                      |                             |                            |                             | sii.p.w r7,a2+,1           |                             | sii.p.w r7.a2+,1           |
|-------|------------------|------------------|-----------------------|------------------|-----------------------------|--------------------|-----------------------------|----------------------|-----------------------------|----------------------------|-----------------------------|----------------------------|-----------------------------|----------------------------|
| DSO   |                  |                  |                       |                  | pexchg.pd.w r3,r2,2x2_north |                    | pexchg.pd.w r3,r2,2x2_north |                      | pexchg.pd.w r3,r2,2x2_north | pexchg.pd.w r6,r5,2x2_east | pexchg.pd.w r3,r2,2x2_north | pexchg.pd.w r6,r5,2x2_east | pexchg.pd.w r3,r2,2x2_north | pexchg.pd.w r6,r5,2x2_east |
| MAU   |                  |                  | mpycx.pm.2sh r2,r0,r1 |                  | mpycx.pm.2sh r2,r0,r1       |                    | mpycx.pm.2sh r2,r0,r1       | mpycx.pm.2h r5,r4,r8 | mpycx.pm.2sh r2,r0,r1       | mpycx.pm.2h r5,r4,r8       | mpycx.pm.2sh r2,r0,r1       | mpycx.pm.2h r5,r4,r8       | mpycx.pm.2sh r2,r0,r1       | mpycx.pm.2h r5,r4,r8       |
| ALU   |                  |                  |                       |                  |                             | sub.pa.2h r4,r2,r3 |                             | sub.pa.2h r4,r2,r3   |                             | sub.pa.2h r4,r2,r3         | sub.pa.2h r7,r5,r6          | sub.pa.2h r4,r2,r3         | sub.pa.2h r7,r5,r6          | sub.pa.2h r4,r2,r3         |
| LOAD  | lii.p.w R0,a0=,1 | lii.p.w R1,a1=,1 | lii.p.w R0,a0=,1      | lii.p.w R1,a1=,1 | lii.p.w R0,a0=,1            | lii.p.w R1,a1=,1   | lii.p.w R0,a0=,1            | lii.p.w R1,a1=,1     | lii.p.w R0,a0=,1            | lii.p.w R1,a1=,1           | 1 lii.p.w R0,a0=,1          | 2 Hii.p.w R1,a1=,1         | 1 lii.p.w R0,a0=,1          | 2 lii.p.w R1,a1=,1         |
| ≥     |                  |                  |                       |                  |                             |                    |                             |                      |                             |                            | •                           | . 4                        | ,                           |                            |

FIG. 91

# 14/23

| LOAD ALU   MAU DSU STORE                                                                                                                                                                   | 3 |                  |                     |     |                    |                  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|------------------|---------------------|-----|--------------------|------------------|
| add.pa.4h r6,r4,r12   pexchg.pd.w r4,r2   pexchg.pd.w r4,r2   pexchg.pd.w r5,r3   add.pa.4h r6,r4,r2   pexchg.pd.w r4,r12     pexchg.pd.w r4,r12   pexchg.pd.w r5,r13   pexchg.pd.w r5,r13 |   | LOAD             | ALU                 | MAU | nsa                | STORE            |
| i.pd r2,a0+,1                                                                                                                                                                              | 1 |                  | add.pa.4h r6,r4,r12 |     | pexchg.pd.w r4,r2  |                  |
| add.pa.4h r6,r4,r2 pexchg.pd.w r4,r12                                                                                                                                                      | 2 |                  |                     |     | pexchg.pd.w r5,r3  | sii.p.d r6,a1+,1 |
| pexchg.pd.w r5,r13                                                                                                                                                                         | က |                  | add.pa.4h r6,r4,r2  |     | pexchg.pd.w r4,r12 |                  |
|                                                                                                                                                                                            | 4 | lii.p.d r2,a0+,1 |                     |     | pexchg.pd.w r5,r13 | sii.p.d r6,a1+,1 |

|   | STORE |                                            | sli.p.d r6,a1+,1                          |                                        | sii.p.d r6,a1+,1      |
|---|-------|--------------------------------------------|-------------------------------------------|----------------------------------------|-----------------------|
|   | DSO   | pexchg.pd.w r4,r2                          | pexchg.pd.w r5,r3                         | pexchg.pd.w r4,r12                     | pexchg.pd.w r5,r13    |
|   | MAU   | mpycx.pm.2sh r12,r0,r18 pexchg.pd.w r4,r2  | mpycx.pm.2sh r13,r1,r19 pexchg.pd.w r5,r3 | mpycx.pm.2sh r2,r0,r8                  | mpycx.pm.2sh r3,r1,r9 |
|   | ALU   | 1   III.p.d r8,a2+,1   sub.pa.4h r6,r4,r12 |                                           | 3 lii.p.d r18,a2+,1 sub.pa.4h r6,r4,r2 |                       |
|   | LOAD  | lii.p.d r8,a2+,1                           | 2 lii.p.d r0,a0+,1                        | lii.p.d r18,a2+,1                      | 4 lii.p.d r0,a0+,1    |
| j |       | -                                          | 2                                         | က                                      | 4                     |

FIG. 9J

## 15/23

$$y = (I_{m} \otimes A) \times$$
Example:  $y = (I_{4} \otimes A) \times_{8}$ 

$$\begin{bmatrix} x_{0} \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \otimes \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \\ x_{5} \\ x_{6} \\ x_{7} \end{bmatrix} = \begin{bmatrix} a & b & 0 & 0 & 0 & 0 & 0 & 0 \\ c & d & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & a & b & 0 & 0 & 0 \\ 0 & 0 & 0 & c & d & 0 & 0 \\ 0 & 0 & 0 & 0 & c & d & 0 \\ 0 & 0 & 0 & 0 & 0 & c & d \end{bmatrix} \begin{bmatrix} x_{0} \\ x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \\ x_{5} \\ x_{6} \\ x_{7} \end{bmatrix} = \begin{bmatrix} ax_{0} + bx_{1} \\ cx_{0} + dx_{1} \\ cx_{0} + dx_{1} \\ ax_{2} + bx_{3} \\ cx_{2} + dx_{3} \\ ax_{4} + bx_{5} \\ cx_{4} + dx_{5} \\ ax_{6} + bx_{7} \\ cx_{6} + dx_{7} \end{bmatrix}$$

**FIG. 10A** 

$$y = (A \otimes I_{m}) \times$$
**Example:**  $y = (A \otimes I_{4}) \times_{8}$ 

$$\begin{bmatrix} a & b \\ c & d \end{bmatrix} \otimes \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} X_{0} \\ X_{1} \\ X_{2} \\ X_{3} \\ X_{4} \\ X_{5} \\ X_{6} \\ X_{7} \end{bmatrix} = \begin{bmatrix} a & 0 & 0 & 0 & b & 0 & 0 & 0 \\ 0 & a & 0 & 0 & b & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & b & 0 \\ 0 & 0 & 0 & 0 & 0 & b & 0 \\ 0 & 0 & 0 & 0 & 0 & b & 0 \\ 0 & 0 & 0 & 0 & 0 & b & 0 \\ 0 & 0 & 0 & 0 & 0 & b & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & b \\ 0 & 0 & 0 & 0 & 0 & 0 & b \\ 0 & 0 & 0 & 0 & 0 & 0 & d \\ 0 & 0 & 0 & 0 & 0 & d & 0 \\ 0 & 0 & 0 & 0 & 0 & d & 0 \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\ 0 & 0 & 0 & 0 & d & d \\$$

**FIG. 10B** 

MPYA-MULTIPLY ACCUMULATE

1100

ENCODING

FIG. 11A

Syntax/Operation

Dual Halfwords None Nord None None Ą 잂 Ŧ Do operation only if T/F condition is satisfied in ACFs Do operation below but do not affect ACFs Do operation below but do not affect ACFs RtollRte  $\leftarrow$  (Rx \* Rv) + RtollRte Rto  $\leftarrow$  (Rx.H1 \* Ry.H1) + Rto OPERATION OPERANDS MPYAICNVZI.ISPIM.1ISUIW | Rte, Rx, Ry MPYA[CNVZ].[SP]M.1[SU]H [TF].MPYA.[SP]M.1[SU]W MPYA.ISPIM.1ISUIW MPYA.[SP]M.2[SU]H NSTRUCTION

16/23

ARITHMETIC FLAGS AFFECTED

N=MSB OF RESULT Z=1 IF RESULT IS ZERO, 0 OTHERWISE

V=1 IF AN OVERFLOW OCCURS ON THE ADDITION, 0 OTHERWISE C=1 IF AN CARRY OCCURS ON THE ADDITION, 0 OTHERWISE

FIG. 11B

None

Do operation only if T/F condition is satisfied in ACFs

Rte, Rx, Ry

[TF].MPYA.ISPJM.ZISUJH

Rte ← (Rx.H0 \* Ry.H0) + Rte

요

FIG. 12A

SUM2PA - SUM OF 2 PRODUCTS ACCUMULATE ENCODING

Syntax/Operation

| INSTRUCTION                          | OPERANDS OPERATION | OPERATION                                                                                      | ACF                   |   |
|--------------------------------------|--------------------|------------------------------------------------------------------------------------------------|-----------------------|---|
|                                      |                    | a                                                                                              | <b>Dual Halfwords</b> | Г |
| SUMZPA.[SP]M.2[SU]H                  | Rt, Rx, Ry         | Do operation below but do not affect ACFs                                                      | None                  |   |
| SUMZPAJCNVZJ.[SP]M.2[SU]H Rt, Rx, Ry | Rt, Rx, Ry         | Rt ← Rt + (Rx.H1 * Ry.H1) + (Rx.H0 * Ry.H0)                                                    | Ю.                    |   |
| TTFJ.SUMZPA.[SPJM.2[SUJH Rt, Rx, Ry  | Rt, Rx, Ry         | Do operation only if T/F condition is satisfied in ACFs                                        | None                  | Г |
|                                      |                    | ď                                                                                              | Quad Halfwords        |   |
| SUM2PA.[SP]M.4[SU]H                  | Rte, Rxe, Rye      | Rte, Rxe, Rye Do operation below but do not affect ACFs                                        | None                  |   |
| SUM2PA[CNVZ].[SP]M.4[SU]H            | Rte, Rxe, Rye      | SUM2PA[CNVZ];SP]M.4[SU]H Rte, Rxe, Rye Rto ← Rto+ (Rxo.H1 * Ryo.H1) + (Rxo.H0 * Ryo.H0)        | 됴                     | Γ |
|                                      |                    | Rte ← Rte+ (Rxe.H1 * Rye.H1) + (Rxe.H0 * Rye.H0)                                               | 요                     |   |
| [TF].SUM2PA.[SP]M.4[SU]H             | Rte, Rxe, Rye      | [TF].SUM2PA.[SP]M.4[SU]H Rte, Rxe, Rye Do operation only if T/F condition is satisfied in ACFs | None                  | Γ |

17/23

ARITHMETIC SCALAR FLAGS AFFECTED (on least significant operation)

N=MSB OF RESULT

Z=1 IF RESULT IS ZERO, 0 OTHERWISE V=1 IF AN OVERFLOW OCCURS ON THE ADD WITH Rt, 0 OTHERWISE C=1 IF AN CARRY OCCURS ON THE ADD WITH Rt,, 0 OTHERWISE

FIG. 12B

R=FLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY

R-ROUND FOR ROUNDING TO THE NEAREST INTEGER

MPYCXA-MULTIPLY COMPLEX ACCUMULATE ENCODING

FIG. 13A

Syntax/Operation

330

18/23 Dual Halfwords None 五 Rt.H0  $\leftarrow$  round ((Rt.H0+ Rx.H1 \* Ry.H0 + Rx.H0 \* Ry.H1)/2  $^{15}$ RtH1  $\leftarrow$  round ((RtH1+ Rx.H1 \* Ry.H1 - Rx.H0 \* Ry.H0)/2 Do operation only if T/F condition is satisfied in ACFs Do operation below but do not affect ACFs OPERATION OPERANDS RoundMode RoundMode RoundMode Rt, Rx, Ry Rt, Rx, Ry MPYCXA[CNVZ].[SP]M.2SH Rt, Rx, Ry [TF].MPYCXA.ISP]M.2SH MPYCXA.[SP]M.2SH NSTRUCTION

ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) | THE OPERAND RoundMode IS ONE OF THE FOLLOWING: R=TRUNC FOR ROUNDING TOWARDS ZERO N=MSB OF RESULT

v=1 if an integer overflow occurs on either operation prior to the | R=Ceil for rounding towards positive infinity Z=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE

DIVIDE, 0 OTHERWISE C=NOT AFFECTED

R=CEIL FOR ROUNDING TOWARDS POSITIVE INFINITY R=FLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY

R=TRUNC FOR ROUNDING TOWARDS ZERO

R-ROUND FOR ROUNDING TO THE NEAREST INTEGER

MPYCXAD2-MULTIPLY COMPLEX ACCUMULATE DIVIDE BY 2 ENCODING

FIG. 14A

Syntax/Operation

| INSTRUCTION                                  | OPERANDS OPERATION      | OPERATION                                                                          | ACF                   | _     |
|----------------------------------------------|-------------------------|------------------------------------------------------------------------------------|-----------------------|-------|
|                                              |                         | Dual F                                                                             | <b>Dual Halfwords</b> |       |
| MPYCXAD2.[SP]M.2SH                           | Rt, Rx, Ry<br>RoundMode | Do operation below but do not affect ACFs                                          | None                  | 19/23 |
| MPYCXADZ[CNVZ].[SP]M.2SH Rt, Rx, Ry          | Rt, Rx, Ry              | $RtH1 \leftarrow round ((RtH1 + RxH1 * RyH1 - RxH0 * RyH0))2^{16}$                 | F                     |       |
|                                              | RoundMode               | Rt.H0 $\leftarrow$ round ((Rt.H0+ Rx.H1 * Ry.H0 + Rx.H0 * Ry.H1)/2 <sup>16</sup> ) | 요                     |       |
| [TF].MPYCXADZ.[SP]M.2SH Rt, Rx, Ry RoundMode | Rt, Rx, Ry<br>RoundMode | Do operation only if T/F condition is satisfied in ACFs                            | None                  |       |
|                                              |                         |                                                                                    |                       | _     |

40/22

ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) | THE OPERAND RoundMode IS ONE OF THE FOLLOWING: N=MSB OF RESULT

V=1 IF AN INTEGER OVERFLOW OCCURS ON EITHER OPERATION PRIOR TO THE Z=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE DIVIDE, 0 OTHERWISE

C=NOT AFFECTED

MPYCXJ-MULTIPLY COMPLEX CONJUGATE ACCUMULATE ENCODING

1500

FIG. 15A

Syntax/Operation

20/23 **Dual Halfwords** ACF None None 요 五 Rt.H1  $\leftarrow$  round ((Rt.H1+ Rx.H1 \* Ry.H1 + Rx.H0 \* Ry.H0)/2 " RtH0  $\leftarrow$  round ((RtH0+ Rx.H0 \* Ry.H1 - Rx.H1 \* Ry.H0/2" Do operation only if T/F condition is satisfied in ACFs Do operation below but do not affect ACFs OPERATION OPERANDS RoundMode RoundMode RoundMode Rt, Rx, Ry Rt, Rx, Ry Rt, Rx, Ry MPYCXJA[CNVZ].[SPJM.2SH [TF].MPYCXJA.[SP]M.2SH MPYCXJA.[SP]M.2SH NSTRUCTION

R=FLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) | THE OPERAND ROUNDMODE IS ONE OF THE FOLLOWING: R=CEIL FOR ROUNDING TOWARDS POSITIVE INFINITY R=TRUNC FOR ROUNDING TOWARDS ZERO V=1 IF AN INTEGER OVERFLOW OCCURS ON EITHER, 0 OTHERWISE Z=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE

N=MSB OF RESULT

C=NOT AFFECTED

CYCLES:

R=ROUND FOR ROUNDING TO THE NEAREST INTEGER 15B MPYCXJD2-MULTIPLY COMPLEX CONJUGATE ACCUMULATE DIVIDE BY 2

ENCODING

31 |30 |29 |28 |27 |26 |25 |24 |23 |22 |21 |20 |19 |18 |17 |16 |15 |14 |13 |12 |11 |10 |9 |8 |7 |6 |5 |4 |3 |2 |1 |0 O RM 8 æ ž ₩ MAUopcode SROUP S/P UNIT

FIG. 16A

Syntax/Operation

1610

Dual Halfwords None Sue 요 五 RtH1 ← round ((RtH1+ RxH1 \* RyH1 + RxH0 \* RyH0)/2 Rt.H0  $\leftarrow$  round ((Rt.H0+ Rx.H0 \* Ry.H1 - Rx.H1 \* Ry.H0/2  $\dot{}$ Do operation only if T/F condition is satisfied in ACFs Do operation below but do not affect ACFs OPERANDS OPERATION RoundMode RoundMode RoundMode Rt, Rx, Ry R. R. R. MPYCXJADZ[CNVZ].[SP]M.2SH|Rt, Rx, Ry [TF].MPYCXJAD2.[SP]M.2SH MPYCXJAD2.[SP]M.2SH NSTRUCTION

21/23

ARITHMETIC SCALAR FLAGS AFFECTED (ON LEAST SIGNIFICANT OPERATION) THE OPERAND RoundMode IS ONE OF THE FOLLOWING: V=MSB OF RESULT

2=1 IF ZERO RESULT IS GENERATED, 0 OTHERWISE V=1 IF AN INTEGER OVERELOW OCCURS ON EITHER OPERATION, 0 OTHERWISE C=NOTA FFECTED

RECEIL FOR ROUNDING TOWARDS POSITIVE INFINITY
REFLOOR FOR ROUNDING TOWARDS NEGATIVE INFINITY
REROUND FOR ROUNDING TO THE NEAREST INTEGER

R=TRUNC FOR ROUNDING TOWARDS ZERO

FIG. 16B





International application No. PCT/US99/23494

#### A. CLASSIFICATION OF SUBJECT MATTER

IPC(6) :G06F 7/50, 7/52, 15/16, 15/80, 17/10, 17/14 US CL :712/11, 16, 205, 708/232, 404, 524, 622

US CL:712/11, 16, 205; 708/232, 404, 524, 622
According to International Patent Classification (IPC) or to both national classification and IPC

B. FIELDS SEARCHED

Minimum documentation searched (classification system followed by classification symbols)

U.S.: 712/11, 16, 205; 708/232, 404, 524, 622

Documentation searched other than minimum documentation to the extent that such documents are included in the fields searched

Electronic data base consulted during the international search (name of data base and, where practicable, search terms used) STN/CAS

### C. DOCUMENTS CONSIDERED TO BE RELEVANT

| Category* | Citation of document, with indication, where appropriate, of the relevant passages | Relevant to claim No. |
|-----------|------------------------------------------------------------------------------------|-----------------------|
| Y         | US 4,709,327 A (HILLIS et al) 24 November 1987, cols. 1-46.                        | 1-14, 19-34, 36-39    |
| Y         | US 5,367,692 A (EDELMAN) 22 November 1994, cols. 1-22.                             | 1-14, 19-20           |
| Y         | US 5,659,785 A (PECHANEK et al) 19 August 1997, cols. 1-18.                        | 15, 16, 35            |
| Y         | US 5,649,135 A (PECHANEK et al) 15 July 1997, cols. 1-20.                          | 15, 16, 35            |
| Y         | US 5,682,491 A (PECHANEK et al) 28 October 1997, cols. 1-18.                       | 15, 16, 35            |
| Y         | US 5,428,754 A (BALDWIN) 27 June 1995, cols. 1-122.                                | 17, 18                |
|           |                                                                                    |                       |
|           |                                                                                    |                       |
|           |                                                                                    |                       |

| Δ      | retiner documents are instead in the continuation of Box C                                                                                                                | ~ Ш     | See patent rathery annex.                                                                                                                                                                       |
|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ٠,٠    | Special categories of cited documents<br>document defining the general state of the art which is not considered<br>to be of particular relevance                          | •т•     | later document published after the international filing date or priority date and not in conflict with the application but sited to understand the principle or theory underlying the invention |
| ·E•    | earlier document published on or after the international filling date                                                                                                     | .x.     | document of particular relevance; the claimed invention cannot be<br>considered novel or cannot be considered to involve an inventive step                                                      |
| ·L·    | document which may throw doubts on priority claum(s) or which is<br>cited to establish the publication date of another citation or other<br>special reason (as specified) | .4.     | when the document is taken alone<br>document of particular relovance, the claimed invention cannot be<br>considered to proble an inventive step when the document is                            |
| .0.    | document referring to an oral declosure, use, exhibition or other means                                                                                                   |         | combined with one or more other such documents, such combination<br>being obvious to a person skilled in the art                                                                                |
| *p*    | document published prior to the international filing date but later than the priority date clauned                                                                        | *&*     | document member of the suns patent family                                                                                                                                                       |
|        | of the actual completion of the international search                                                                                                                      | Date of | mailing of the international search report 7 FEB 2000                                                                                                                                           |
| Com    | and mailing address of the ISA/US<br>missioner of Patents and Trademarks<br>PCT<br>hington, D.C. 20231                                                                    | Authori | red officer Suggina Joyan                                                                                                                                                                       |
| Facsin | nile No. (703) 305-3230                                                                                                                                                   | Telepho | ne No. (703) 305-8498                                                                                                                                                                           |

International application No. PCT/US99/23494

| Category* | Citation of document, with indication, where appropriate, of the relevant passages | Relevant to claim N |
|-----------|------------------------------------------------------------------------------------|---------------------|
|           |                                                                                    |                     |
| •         | US 5,179,530 A (GENUSOV et al) 12 January 1993, cols. 1-20.                        | 17, 18              |
|           | US 5,798,753 A (ZHOU et al) 25 August 1998, cols. 1-18.                            | 21-34, 36-39        |
|           | 03 3,728,733 A (21100 et al) 23 August 1998, cois. 1-18.                           | 21-34, 30-37        |
|           |                                                                                    | J                   |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    | 1                   |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
| 1         |                                                                                    |                     |
|           |                                                                                    |                     |
| - 1       |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
| - 1       |                                                                                    | )                   |
|           |                                                                                    |                     |
|           |                                                                                    |                     |
| }         |                                                                                    |                     |
| 1         |                                                                                    | }                   |
| 1         |                                                                                    |                     |
| )         |                                                                                    |                     |
| }         |                                                                                    |                     |
| )         |                                                                                    |                     |
| )         |                                                                                    |                     |
| )         |                                                                                    |                     |
| (         |                                                                                    | l .                 |

International application No. PCT/US99/23494

| Box I Observations where certain claims were found unsearchable (Continuation of item 1 of first sheet)                                                                                                                         |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| This international report has not been established in respect of certain claims under Article 17(2)(a) for the following reasons:                                                                                               |
| Claims Nos.:     because they relate to subject matter not required to be searched by this Authority, namely:                                                                                                                   |
| Claims Nos.:     because they relate to parts of the international application that do not comply with the prescribed requirements to such an extent that no meaningful international search can be carried out, specifically:  |
| Claims Nos.:     because they are dependent claims and are not drafted in accordance with the second and third sentences of Rule 6.4(a).                                                                                        |
| Box II Observations where unity of invention is lacking (Continuation of item 2 of first sheet)                                                                                                                                 |
| This International Searching Authority found multiple inventions in this international application, as follows:                                                                                                                 |
| Please See Extra Sheet.                                                                                                                                                                                                         |
|                                                                                                                                                                                                                                 |
|                                                                                                                                                                                                                                 |
|                                                                                                                                                                                                                                 |
|                                                                                                                                                                                                                                 |
| 1. X As all required additional search fees were timely paid by the applicant, this international search report covers all searchable claims.                                                                                   |
| 2. As all searchable claims could be searched without effort justifying an additional fee, this Authority did not invite payment                                                                                                |
| of any additional fee.     As only some of the required additional search fees were timely paid by the applicant, this international search report covers only those claims for which fees were paid, specifically claims Noz.: |
|                                                                                                                                                                                                                                 |
| No required additional search fees were timely paid by the applicant. Consequently, this international search report is restricted to the invention first mentioned in the claims; it is covered by claims Nos.:                |
| Remark on Protest The additional search fees were accompanied by the applicant's protest.                                                                                                                                       |
| No protest accompanied the payment of additional search fees.                                                                                                                                                                   |

International application No. PCT/US99/23494

BOX II. OBSERVATIONS WHERE UNITY OF INVENTION WAS LACKING This ISA found multiple inventions as follows:

This application contains the following inventions or groups of inventions which are not so linked as to form a single inventive concept under PCT Rule 13.1. In order for all inventions to be searched, the appropriate additional search fees must be paid.

Group I, claims 1-14 and 19-20, drawn to an array processor for performing multiplications on complex numbers;

Group II, claim 15, drawn to an array processor for computing an FFT using twiddle factors;

Group III, claim 16, drawn to an array processor for computing a distributed FFT using twiddle factors, multiplication, adding or substractine:

Group IV, claims 17-18, drawn to an array processor for multiplying specific subparts of complex numbers and twiddle factors;

Group V, claims 21-34, 36-39, drawn to fetching complex multiplication instructions and dispatching of the instructions: and

Group VI, claim 35, drawn to an array processor for performing officient processing of an FFT.

The inventions listed as Groups I-VI do not relate to a single inventive concept under PCT Rule 13.1 because, under PCT Rule 13.2, they lack the same or corresponding special technical features for the following reasons. Group I lacks the particulars of the calculation of an FTT using widdle factors of Group II; the calculation of a distributed FTT using widdle factors, multiplication, adding or subtracting; the computation of multiplying specific subparts of complex numbers and widdle factors; the facting and dispatching of complex numblification instructions; and for performing efficient processing of an FTT. As per Groups II to Group VI, they lack the same or corresponding special technical features with present to each other for similar reasons as stated above.